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Abstract 


A common way of constructing a multiclass classifier is by combining the outputs of several binary 
ones, according to an error-correcting output code (ECOC) scheme. The combination is typically 
done via a simple nearest-neighbor rule that finds the class that is closest in some sense to the 
outputs of the binary classifiers. For these nearest-neighbor ECOCs, we improve existing bounds on 
the error rate of the multiclass classifier given the average binary distance. The new bounds provide 
insight into the one-versus-rest and all-pairs matrices, which are compared through experiments 
with standard datasets. The results also show why elimination (also known as DAGSVM) and 
Hamming decoding often achieve the same accuracy. 

Keywords: Error-correcting output codes, all-pairs ECOC matrix, multiclass support vector ma- 
chines 


1. Introduction 


Several techniques for constructing binary classifiers with good generalization capabilities were de- 
veloped in recent years, e.g., support vector machines (SVM) (Cortes and Vapnik, 1995). However, 
in many applications the number of classes is larger than two. While multiclass versions of most 
classification algorithms exist (e.g., Crammer and Singer, 2002), they tend to be complex (Hsu and 
Lin, 2002). A more common approach is to construct the multiclass classifier by combining the 
outputs of several binary ones (Dietterich and Bakiri, 1995, Allwein et al., 2000). Typically, the 
combination is done via a simple nearest-neighbor rule, which finds the class that is closest in some 
sense to the outputs of the binary classifiers. 

The most traditional scheme for solving a multiclass problem with binary classifiers is based on 
the so-called one-versus-rest matrix. However, the popularity of an alternative scheme based on the 
all-pairs matrix (also known as / versus 1, round-robin and pairwise decomposition) has recently 
increased (see, e.g., Fiirnkranz, 2002). All-pairs with Hamming decoding is related to well-known 
methods of paired comparisons in statistics (David, 1963), and it was first applied to classification 
problems by Friedman (1996). 

There are theoretical results that compare some aspects of the all-pairs and one-versus-rest 
(among other) matrices. These results also suggest guidelines for constructing accurate multiclass 
classifiers. For example, recent work has used the error incurred by the binary classifiers to up- 


©2003 Aldebaro Klautau, Nikola Jevtić and Alon Orlitsky. 


KLAUTAU, JEVTIĆ AND ORLITSKY 


per bound the error committed by the combined nearest-neighbor classifier (Guruswami and Sahai, 
1999, Allwein et al., 2000). These results are reviewed and expanded here. 

We present theoretical and experimental contributions. We strengthen the bounds by Allwein 
et al. (2000) and extend the class of decoders to which they apply. These improved bounds provide 
insight into the properties of certain ECOC matrices when the number of classes is large. We also 
conduct detailed experiments directly comparing ECOC schemes that use the all-pairs and one- 
versus-rest matrices for solving multiclass problems with SVM, complementing previous work (e.g., 
Allwein et al., 2000, Hsu and Lin, 2002). Our results show that Hamming decoding is very effective 
for all-pairs. Additionally, our experimental results explain why elimination (Kreßel, 1999) (also 
known as DAGSVM) and Hamming decoding often achieve similar accuracy. 

The paper is organized as follows. Section 2 provides a brief review about the construction 
of multiclass classifiers from binary ones and establishes the notation. Theoretical bounds for the 
multiclass error are presented in Section 3. Experimental results are presented in Section 4, followed 
by conclusions in Section 5. 


2. Background on ECOC 


In supervised classification problems, one is given a training set {(X1,y1),---,(Xv,yn)} containing 
N examples. Each example (x,y) consists of an instance x € X and a label y € {1,...,K}, where X is 
the instance space and K > 2 is the number of classes. A classifier is a mapping F : X — {1,...,K} 
from instances to labels. For binary problems (K = 2 classes) the examples are labeled —1 and 
+1, for convenience. We assume the base learner is class-symmetric, i.e., the learning problem is 
equivalent if we exchange the labels —1 and +1, and we are especially interested on confidence- 
valued binary classifiers f : X — R that return a score. 

One of the most successful methods for constructing multiclass classifiers is to combine the out- 
puts of several binary classifiers. First, a collection f,...,fg of B binary classifiers is constructed, 
where each classifier is trained to distinguish between two subsets of classes. The classes involved 
in the training of the binary classifiers are typically specified by a matrix M € {—1,1}**8 (Diet- 
terich and Bakiri, 1995) or M € {—1,0,1}* (Allwein et al., 2000), and classifier fp is trained 
according to column M(.,b). 

The K-ary classifier F takes the scores f(x) = (fi(x),...,fg(x)) and combines them using a 
function g : R? — {1,...,K} to obtain F(x) = g(f(x)). One can view the rows of the matrix M as 
codewords and the function g as decoding the output f(x) of the binary classifiers. By analogy to 
coding,! M is refereed to as an ECOC matrix and the function g is called the decoder. We call the 
combination of a matrix M and a decoder g, an ECOC scheme or simply ECOC. 

In spite of the appeal of matrices inspired by coding, the most popular ECOC matrices are 
obtained by simply taking all combinations of œ versus (vs.) B classes, a+ B < K, where each 
binary classifier is trained to distinguish positive from B negative classes. We will be specially 
interested in the 1 vs. 1 (all-pairs) and 1 vs. K — 1 (one-vs-rest) matrices. The one-vs-rest matrix 
induces B = K binary classifiers f),...,fx. The all-pairs matrix induces B = (5) binary classifiers 
fip Ste pe, 


1. Many results from coding can be promptly used for ECOCs, specially when M € { ee er For example, 
Theorem 2 by Berger (1999) corresponds to the well-known Plotkin’s bound, which states that p < (0.5BK)/(K — 1), 
where p is the minimum Hamming distance between two distinct rows of M. 
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ECOC schemes often adopt a nearest-neighbor decoder. These decoders use a distance measure 
d : RË x {-1,0,1}8 — R, and select the class F(x) = arg min, d(f(x),M(k,-)) that minimizes the 
distance between scores f(x) and row M(k,-). Of special interest are loss-based distances (Allwein 
et al., 2000), which are defined by 


Me 


d(f(x),M(k,-)) = ) LEko), a) 


b=1 


where L : R > R is a loss function,” and zzp = fo(x)M(k,b) would be the margin under classifier 
Jo if M(k,b) were the label of instance x. 

It is easy to show that, for any ECOC with loss-based decoding, all linear loss functions L(z) = 
cı +c2z with negative (positive) c2 lead to the same classification result. If the binary learner is 
an SVM, decoding with L(z) = (1 —z)+, where (x), = max{x,0}, has the appeal of matching the 
criterion used to maximize the margin when training the SVMs (Allwein et al., 2000). 

The natural decoding method for an ECOC with the one-vs-rest matrix is to select the class k 
that maximizes score f(x). This decoder is called max-wins. It can be shown that for the one- 
vs-rest matrix, several choices of L lead to the same classification result as max-wins, such as 
L(z) = (1 —z)+ or when L is a strictly decreasing function. 

Instead of scores, the binary classifiers may return hard decisions h(x) € {—1, ie or the results 
of the binary classifiers may be quantized to {—1,1} to overcome unreliability. A natural decoder 
in these cases is the Hamming decoder (also known as voting), the nearest-neighbor decoder that 
minimizes the Hamming distance (modified to allow for M(k,b) = 0): 


B 
dy (h(x), M(k,-)) =0.5 È, (1 — h(x)M(&k,b)). Q) 
b=1 
Hamming distance is a special case of a loss-based distance where L(z) = (1 — sign(z))/2 (Allwein 
et al., 2000). 

We are mostly concerned with nearest-neighbor decoders, for which theoretical results are pre- 
sented in Section 3. In general however, the decoder g can be any mapping, such as the one obtained 
with a stacked artificial neural network (Klautau et al., 2002). Another example of a non-nearest- 
neighbor decoder is the one proposed by Moreira and Mayoraz (1998). They adopted the all-pairs 
matrix and a decoding method equivalent to using Equation (1) with L(z’) = —z’, where z’ is a 
weighted margin z; , = @pf(x)M(k,b). The weight œp of classifier fp was obtained with an addi- 
tional ECOC based on a 2 vs. K — 2 matrix. 


3. Bounds on the K-ary Error 


Previous work (Guruswami and Sahai, 1999, Allwein et al., 2000) used the error, and more gener- 
ally, distance, incurred by the binary classifiers, to upper bound the error committed by the K-ary 
classifier with nearest-neighbor decoding. This section strengthens these bounds, extends the dis- 
tance measures to which they apply and provides some insight into the properties of a vs. B ECOC 
matrices when K is large. We begin by discussing results for any distance d. Then we specialize 
the bounds for the cases where d is the Hamming distance, and later for ECOCs with Hamming 
distance and & vs. B matrix. 


2. Allwein et al. (2000) defined L : R — [0,°0), but here we extend the range of L to allow for, e.g., L(z) = —z, used 
by Zadrozny (2002). 


KLAUTAU, JEVTIĆ AND ORLITSKY 


3.1 Bounds for Nearest-Neighbor with General Distance 


The number of errors the K-ary classifier F commits on a given set (e.g., training or held-out set) 
with N examples is €x ef Hn: F(Xn) A yn}| and its error rate is 
g, =e, 
N 
The accumulated distance between the outputs of the binary classifiers and the correct codeword 
over this set is D = XA d(f(Xn),M(yn,-)) and their average distance is 


— def D 
DEL, 
N 
To relate €, and D, the minimum Hamming distance between any two rows was defined by All- 


wein et al. (2000) to be 
p © min{dy(M(k,-),M(W,-)) kK AR}, 


where dy is defined in Equation (2). For example, for one-vs-rest p = 2, and for all-pairs p = 
(B+ 1)/2. Allwein et al. (2000) also used 


cae { L(z) +L(—z) } 


zER 2 


to prove some of their results. 

Here we use the following two definitions related to distances between scores f(x) and rows of 
M. Given an ECOC matrix M, a distance measure d, and a vector f € R”, let dı (f) be the smallest 
distance between f and any row of M, and let d2(f) > dı (f) be the smallest distance between f and 
the remaining rows of M. Define 


dı =mind,(f) and d = mind)/(f). 
feR? feR? 


For example, for an ECOC with the one-vs-rest matrix and Hamming decoding, dı = 0 (achieved 
when f(x) = h(x) matches a codeword) and d} = 1 (achieved when f(x) = h(x) contains two 
elements +1 while the others are —1). And for an ECOC with the all-pairs matrix and Hamming 
decoding, dı = 0.5 Cy) and dz = dı + 1. The following result was originally presented by Allwein 
et al. (2000), and is restated here using d2. 


Lemma 1 (Implicit by Theorem 1 published by Allwein et al., 2000) For any ECOC with loss- 
based decoding using L : R — [0,), 











dz > pL*. 





According to Lemma 1, whenever an example (x,y) leads to an error, the total error €x is incre- 
mented by 1 and at least pL* is added to the distance D. This reasoning can be used to interpret the 
bound 


Ek < pL*’ (3) 


which is proved by Allwein et al. (2000) for any ECOC with loss-based decoding using L : R — 
(0, cc). Using the definitions of dı and d2, Equation (3) can be strengthened as follows. 
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Theorem 2 For any ECOC with nearest-neighbor decoding, 


D—d, 
Ek S 


d —d, 





Proof Split the set of instances into 


CE (Ann) Fn) =a} and WE {Can yn) iFa) Aya, 


containing the correctly and wrongly classified examples, respectively. The accumulated distance 
D can then be written as 


D= Y (fn) MOn DH È d(f(&),MOn,°))- 


(XnYn)EC (Xn Yn) EW 


The total number of errors is €x = |W, hence the first part is at least |C|d; = (N — £€x)d1, and the 
second is at least |W |d2 = €xd2. Therefore D > (N — €x)d, + €xd2. Normalizing by N and solving 
for Ex, we obtain the theorem. 














We note that Theorem 2 applies to all distance measures, not just the loss-based ones with L(z) > 
0 and L* > 0, as required for Equation (3). Also, for loss-based distances, and when Equation (3) 
is applicable and not trivial (i.e., D < pL*), Theorem 2 is always at least as strong as Equation (3) 
because D < pL* < do, and hence, Vd; > 0, 


Yl 


D-d; 
dy —d, 








D 
<—< ; 
Td ~ pL* 

A special case of interest is when the distance d obeys the triangle inequality. If dmin is the 
minimum distance between two rows of M, by the triangle inequality d2 > dmin/2. For example, 
for any ECOC with Hamming decoding (dmin = p) and a matrix M without zero entries (dı = 0), 
Ex < 2D/p, because the Hamming distance obeys the triangle inequality. For Hamming decoding it 
is also possible to write D in terms of the binary error rate, as discussed in the next subsection. 


3.2 Bounds for Hamming Decoding 


For Hamming decoding, Allwein et al. (2000) presented a more natural form of Equation (3), which 
relates the K-ary classifier’s error €, to that committed by the binary classifiers. Using Theorem 2, 


we strengthen this bound as well. 


Let T% {(n,b) : M(yn,b) = 0} be the set of pairs (n,b) corresponding to examples and binary 
classifiers not used when designing the K-ary classifier, and T° 2 {(n,b) : M(yn,b) # 0} be its 


complement. Clearly, |T| + |Z‘°| = NB. The number of examples misclassified by the binary clas- 


sifiers is then e, & l{(n,b) € T° : hy(Xn) A M(yn,b)}|, and the error rate of the binary classifiers 


1S 
— def Eb 


€p = i 
Fe 


We need dı and dz to apply Theorem 2 for ECOCs with Hamming decoding. In this case, 
dj = 0.5BR"", where Bee is the minimum number of zero entries in a row. In order to conveniently 
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express dz, let Oy w ef {b: M(k,b) 4OAM(k’,b) 40} be the set of columns where both codewords 
M(k,-) and M(k’,-) have non-zero elements. Assume a partial Hamming distance that takes in 
account only columns in Ox w and let 


pı = min{0.5 YP (1—M(k,b)M(k',b)) :k AK} 
kk béu 


be the minimum of such partial distances. Note that for ECOC matrices without zero entries, like 
the one-vs-rest, pı = p. The reason for using p4 is to isolate the influence of zero entries in matrix 
M. The following result can then be proved. 


Lemma 3 For any ECOC with Hamming decoding, 
dy > dı + [p1/2]. 


Proof We are after the classifiers’ output h € {—1,+1}? that minimizes d2(h). Let codewords 
M(r,-) and M(s,-) achieve dı (h) and d2(h), respectively. Define the following sets of columns: 
Soo = {b : M(r,b) = 0A M(s,b) = 0}, Soi = {b : (M(r,b) =OAM(s,b) 4 0)} and Sio = {b : 
(M(7,b) 40AM(s,b) = 0)}. For the entries corresponding to columns b € Soo, h can assume any 
value, and for b € {S01 U S10}, h will match the non-zero entry. Hence, Vh, 


dı (h) + d2(h) > |So0| + 0.5(|.So1| + |.Sio|) + P1- 
The distance d; (h) cannot be larger than d2(h), so 
2d>(h) > dı(h) +. do(h) > BR" +p; = 2d, +1. 


To properly take in account the case where pı is odd, 











dy(h) > dı + [p1/2]. 





Let B1 = |T°|/N be the average number of non-zero entries in each codeword. Applying Theo- 
rem 2 leads to the following result. 


Lemma 4 For any ECOC with Hamming decoding, 
0.5(B— Br" — B1) + B18, 
7 [pi /2] 


Ek 


Proof For Hamming decoding, 


D=0.5|T| +e, =0.5(NB — |T|) + |Top. 


So, D = 0.5(B — B1) + B1£,. From Theorem 2, 


: 0.5(B — B1) + Bié — dı 
ak dy —d, 
x 0.5(B — Br" — B1) + B18, 
7 [p1/2] í 


where the last step follows from Lemma 3. 
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3.3 Bounds for Hamming Decoding and © vs. B Matrix 


The bound in Lemma 4 becomes simpler when applied to ECOCs with o vs. B matrices. For these 
matrices the number B of binary classifiers is Bo +B1, where Bo = Brn and Bı = B4 are the number 
of zero and non-zero elements in each row, respectively. Applying Lemma 4 leads to 





= By = 
Ex < Ep. (4) 
“= [p1/2] 
For example, for one-vs-rest, Equation (4) implies 
Ex < K€p, (5) 


which was originally presented by Guruswami and Sahai (1999). And for all-pairs 
Ex < (K— 1B. © 


We note that for œ vs. B matrices it can be proved that dz achieves the lower-bound in Lemma 3, 
namely dz = dı + [p1/2]. 

To apply Equation (4) for a general & vs. B matrix M, it is convenient to have expressions for B4 
and pı. These can be written in terms of B} and pj}, which are parameters obtained from the base 
matrix M* of M, defined as follows. 

We construct an & vs. B matrix M using matrices 


M* e {1,41} +8) and Pe {0,1,...,0+ B} Cots), 


Given the values K, œ and B of M, the associated M* is simply an a vs. B ECOC matrix with the 
same values &* = & and B* = ß, but with the number of classes K* = a+ B. Clearly, if a+ B =K 
(i.e., if there are no zero entries in M), then M* = M. The matrix P is used to expand M* into M, 
taking in account all ( sop) ways of choosing a+ of the K classes. Each entry P(m,n) = i, i Æ 0, 
is replaced by the i-th row of M*. If P(m,n) = 0, the entry is substituted by B* zeros. For example, 
for a 1 vs. 2 ECOC matrix with K = 4, the matrices M*, P and M are, respectively, 











bisa SER St 1110 +1 -1 -1/+1 -1 -1]4+1 -1 -1] 0 0 0 
ae ee 262 DA og (isl +1 -1|-1 +1 -1| 0 0 of+1 -1 -1 
G iare Se a2 -1 -1 H| o0 0 oj- + Sea + 

0333 o o ofj- +1|-1 -1 4+1]-1 -1 +1 





The number of columns in the base matrix M* is 


erry Ge! 


where the indicator function /(-) (which is 1 if its argument is true and zero otherwise) takes in 
account that, when œ = 8, half of the Ce) binary problems are effectively the same. And the 
minimum Hamming distance for M* is 


a o. 


7 
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+1 


where we assumed B > a. For example, for all-pairs M* = | 1 


je = 1 and p; = 1. Having B* 


and p* for the base matrix M*, one can obtain 


Bi = Boe B* and ee : 


which allow to use Equation (4) for any & vs. B ECOC. 

Based on this result, we now briefly discuss the behavior of & vs. P ECOCs when the number 
K of classes is large. Allwein et al. (2000) stated Equation (3) as Ex < = where € = D/B is the 
average distance per binary classifier (note that § does not explicitly take in account the influence of 
zero entries in M). We note that for all-pairs, as K grows, the proportion Bo/B of zeros in each row 
goes to 1, and € goes to L*, which makes the bound trivial. This can be easily seen for Hamming 


decoding. In this case, 








K z pL* p ’ 
where &y is the average Hamming distance per binary classifier (Corollary 2 in Allwein et al., 2000). 
For all-pairs and large enough K, Ex < 4€y, but also y = L* = 0.5. 

Alternatively, we can look at the behavior for large K of ECOCs with a vs. B and Hamming 
decoding using Equation (4), i.e., using B, /[p;/2]. We note that, in spite of not being achieved by 
all-pairs, there are ECOCs with a vs. B matrix and Hamming decoding for which B1 /[p1/2] — 4, as 
K — œ. When & = B = K /2, Equation (4) leads to E&x < 4(K — 1)€,/K. This is the same asymptotic 
behavior achieved by Hadamard matrices (Guruswami and Sahai, 1999), but œ vs. B matrices may 
correspond to a much larger number B of classifiers. 


s < BÉ _ 2BbH 


4. Experimental Results 


In this section we investigate the individual performance of binary classifiers for different ECOCs. 
We are interested on evaluating if the bounds in Equations (5) and (6) are tight, and in using them 
to get insight into the multiclass performance. Previous work (e.g., Allwein et al., 2000, Hsu and 
Lin, 2002) compared the all-pairs and one-vs-rest matrices in terms of multiclass error, and here 
we concentrate attention on the performance of the binary classifiers. Our experimental setup is 
also propitious to explain why elimination (Kreßel, 1999), which is described below, and Hamming 
decoding are often equivalent in terms of accuracy. 


4.1 The Elimination Decoding Method for All-Pairs 


The elimination decoding method applies only to ECOCs with the all-pairs matrix and quantized 
scores h(x). This decoder was originally described by Kreßel (1999) and independently reintro- 
duced by Platt et al. (2000), where it was called directed acyclic graph SVM (DAGSVM) when 
SVM is the binary learner. It operates iteratively and, at each iteration n = 1,2,...,K — 2, the 
size of the set A, = {hi, i} of active binary classifiers h is decreased. The set ÆA; contains all 
binary classifiers, namely |4,| = B. At iteration n, the output of a classifier hj € A, is com- 
puted, the loosing class t € {1,m} is eliminated, and so are all classifiers related to it, namely 
Anta = Ay — {hi, j¿ii=tVj= t}. The set Ag- contains only one binary classifier, which deter- 
mines the winner class. When compared to Hamming, elimination decoding can lead to substantial 
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average # training 
Name # train # test || #classes || # attributes examples per class 
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Table 1: Datasets used for the experiments. 





savings given that K — 1 binary classifiers are consulted, instead of (5). Platt et al. (2000) found the 
ordering of classifiers h; j to be not important and adopted: (i, j) = (1,2),(1,3),...,(1,K), (2,3), 
...,(K —1,K), which was also used here. 

It is clear that Hamming and elimination decoding can in general lead to different results. For 
example, it may happen in elimination decoding that the class with smallest Hamming distance is 
prematurely eliminated, and the class with the largest Hamming distance is declared the winner. We 
note that, if there is a class that wins all other K — 1 classes, Hamming and elimination decoding 
lead to the same classification result. 


4.2 Datasets and Experimental Setup 


We evaluated the performance of different ECOCs using the eleven standard datasets listed in Ta- 
ble 1. The datasets soybean-large, vowel, isolet, letter, satimage and pendigits are available at the 
UCI repository, with associated documentation. The other five datasets are related to speech recog- 
nition. In order to facilitate reproducing our results, these datasets and their descriptions were made 
available on the Web,° and here we present only a brief summary. The vowel-lsf is a version of 
vowel, obtained by a non-linear transformation (log-area ratios to line spectral frequencies) that is 
standard in speech coding. The e-set is a subset of isolet consisting of the confusable letters {B, C, 
D, E, G, P, T, V, Z}. The two versions of the Peterson and Barney’s vowel data, namely pbvowelF0-3 
and pbvoweluF 1-2, are described by Klautau (2002). The timit-plp40 dataset is a version of TIMIT,* 
a speech database with phonetic transcriptions. We used 12 perceptual linear prediction (PLP) co- 
efficients and energy to represent each frame (10 milliseconds). As phones have different durations, 
we linearly warped them into three regions, and took the average of each region to obtain a vector 
with fixed-length (Ganapathiraju et al., 1998). After adding the phone duration (number of frames), 
we obtained 3 x 13+ 1 = 40 features. We collapsed the 61 TIMIT labels into the standard 39 classes 
proposed by Kai-Fu Lee. 

All datasets have a standard partition into training and test sets, which were used throughout the 
experiments. For each binary training set, the attributes were normalized to the range [0,1] based on 
their minimum and maximum values, and the same normalization factors were used for the test set. 


3. http://speech.ucsd.edu/aldebaro/repository. 
4. http://www. ldc.upenn.edu/Catalog/top _ten. html. 
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SVM parameters K-ary error Ex (%) 
(polynomial kernel, 
Dataset ECOC matrix unless noted) (1—z)+ | Hamming | elimination 
soybean-large | all-pairs 6=1,F =2,C=0.1 10.1 10.1 10.6 
one-vs-rest linear, C = 1 6.6 (+) 8.5 - 
vowel all-pairs RBF, y= 1,C = 10 37.7 34.6 (+) 33.1 
one-vs-rest RBF, y= 10,C = 10 41.3 68.6 - 
vowel-Isf all-pairs RBF, y=1,C=1 32.2 29.9 (+) 30.3 
one-vs-rest RBF, y= 10,C = 10 39.2 66.0 - 
pbvowelFul-2 | all-pairs RBF, y= 10,C = 1 19.0 19.0 19.0 
one-vs-rest RBF, y= 10,C = 10 18.7 28.5 - 
pbvowelF0-3 all-pairs 6=1E=4,C=1 11.2 10.7 10.8 
one-vs-rest RBF, y= 10,C = 10 12.2 15.9 - 


all-pairs 6=0,E =3,C=10 
one-vs-rest 6=0,E =4,C=0.1 
all-pairs 6=0,E =4,C=1 5.6 6.3 5.9 
one-vs-rest 6=1,E=4,C=1 5.6 8.5 - 
letter all-pairs RBF, y= 10,C = 10 
one-vs-rest RBF, y= 10,C = 10 
satimage all-pairs RBF, y= 1,C = 10 
one-vs-rest RBF, y= 10,C=1 
pendigits all-pairs RBF, y= 1,C = 10 2.1 1.6 1.6 
one-vs-rest RBF, y=1,C = 10 1.1 (+) 2.1 - 
timit-plp40 all-pairs RBF, y=1,C=1 31.3 25.9 26.0 
one-vs-rest RBF, y=4,C=1 27.0 42.9 - 





Table 2: Comparison of one-vs-rest and all-pairs matrices in terms of accuracy. The all-pairs with 
Hamming and one-vs-rest with L(z) = (1 — z)+ (max-wins) decoding were compared 
through McNemar’s test, with a symbol (+) indicating the two ECOCs are not equivalent. 


The binary learner was the SVM with either the polynomial K(x,y) = (x-y+)* or Gaussian 
radial-basis function (RBF) K(x,y) = e™llx=yl? kernel. We used the same SVM parameters for 
all binary classifiers of a given ECOC matrix. Because we were interested on comparing the perfor- 
mance of binary classifiers using different ECOCs, we chose the complexity parameter C and kernel 
parameters according to performance on the test set. Therefore, our results should not be interpreted 
as indicating generalization error. More specifically, for each ECOC matrix we tested all decoders 
using the set of parameters that achieved the smallest error with any decoding method. If differ- 
ent sets of SVM parameters achieved the smallest error, we chose the parameters that minimized 
the number of distinct support vectors. This methodology differs from the one adopted by Platt 
et al. (2000), Hsu and Lin (2002), where different SVM parameters could be used for Hamming and 
elimination, making harder to identify the reason for their similar accuracy. 


4.3 Results 


Table 2 shows the results comparing the K-ary error of ECOCs with one-vs-rest and all-pairs. For 
the all-pairs matrix, elimination achieved accuracy similar to Hamming decoding, while L(z) = 
(1 —z)+ was slightly worse. For the one-vs-rest matrix, max-wins has much better performance 
than Hamming decoding, as expected. 


5. Using 6 = 0 and E = 1 leads to a linear SVM, which can be converted to a perceptron to avoid storing the support 
vectors and save computations. 
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Our main goal is to evaluate the binary classifiers, but we note that Table 2 indicates that quan- 
tizing the scores may be beneficial when using all-pairs (or other matrices with zero entries). In 
contrast, Allwein et al. (2000) concluded that L(z) = (1 —z)+ often gives better results than Ham- 
ming decoding for the all-pairs matrix. For example, they reported that, for ECOCs using all-pairs 
and SVMs with polynomial kernel of order 4, decoding with L(z) = (1 —z) + and Hamming led to 
27.5% and 50.4% of error, respectively, for the satimage dataset. As we used the test set to perform 
model selection, their results should not be compared directly to Table 2. However, when we used 
all-pairs and Hamming with the SVM parameters 6 = 1, E = 4 and C=1, the error rate for satimage 
was 11.0%. 

We attribute the fact that Hamming outperforms L(z) = (1 —z) + to the large number of unseen 
classes for binary classifiers of all-pairs. If M(k,b) = 0, we say that class k is unseen with respect 
to classifier fp. During the test stage, all instances associated to an unseen class k lead f, to make 
potentially erratic predictions. All-pairs is the & vs. B matrix with the largest number of unseen 
classes per binary classifier. In this case, the scores are unreliable and quantizing them to {—1,+1} 
can lead to higher accuracy. 

At this point we assume Hamming and max-wins as the decoders for all-pairs and one-vs-rest, 
respectively, and compare the accuracy of these two ECOCs using McNemar’s test (see Dietterich, 
1998) (0.05 significance level). As shown in Table 2, McNemar’s test indicated that the two clas- 
sifiers were equivalent for 7 out of the 11 datasets. We now investigate the performance of the 
individual binary classifiers, trying to characterize the situations where one ECOC outperforms the 
other. This analysis is not required in order to understand the numbers for the soybean-large dataset 
though, which confirm that all-pairs may perform poorly if there is not enough training data for all 
classifiers. 

Table 3 shows the performance on the test set of the binary classifiers corresponding to the 
ECOCs in Table 2. Besides some statistics of the binary error that we will discuss later, Table 3 
presents histograms of Hamming distances dy (M(k*,-),h(x)) between quantized scores h(x) and 
codeword M(k*,-), where k* is the winner class. For each dataset, the sum of the six right-most 
columns is equal to the total number of test instances. These six columns were split into two 
subsets, depending whether Hamming decoding led to a K-ary error or not. For example, for the 
soybean-large dataset and one-vs-rest matrix, 344 test instances were correctly classified (sum of 3 
columns under “when match”) and 32 were misclassified (columns “when error”), corresponding to 
the K-ary error of 8.5% in Table 2. In this case, all binary classifiers made the correct decision for 
343 instances (column “when match / 0”). For one test instance, the Hamming distance was one, 
but the instance was correctly classified (column “when match / 1”). Among the instances that led 
to errors, there were 13 for which dy = 0. 

When dy = 0 for one-vs-rest (only one binary classifier has a positive score), max-wins and 
Hamming decoding lead to the same decision. Table 3 shows that, for one-vs-rest, most of the 
K-ary errors occurred with the winner class leading to dy = 1, while only 5 out of 21,399 test 
instances led to dy > 1. Hence, in almost all cases, the results with max-wins differed from the 
ones with Hamming when the quantized scores h(x) led to either a tie between two or among all 
K classes. In these cases, max-wins could use the magnitudes of scores as a tie-breaking rule, 
outperforming Hamming decoding as shown in Table 2. For the isolet dataset for example, among 


6. We are using the term unseen classes to denote a problem that has been discussed in the literature related to all- 


pairs. For example, Hastie and Tibshirani (1998) conducted an experiment with artificial data to characterize it, and 
mentioned that Geoffrey Hinton originally pointed out the problem. 
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Table 3: Performance of the binary classifiers associated to the ECOC s in Table 2. The six right- 
most columns are histograms of Hamming distances dy between quantized scores h(x) and 
the codeword M(k*,-), where k* is the winner class. For all-pairs, the constant 0.5 Cs ') 
was subtracted from dy. 


the 144 instances that led to dy > 0, 109 and 42 were correctly classified by max-wins and Hamming 
decoding, respectively. In this case, around half of the K-ary errors were associated to h(x) with 
two positive entries. Assuming the correct class was among the two competing, a random guess had 
a 50% error rate. For the cases where h(x) = —1, Vb, Hamming decoding had to randomly break 
the tie among K = 26 classes. 

For all-pairs, the constant 0.5(7') corresponding to (3) zero entries in M was subtracted 
from dy in Table 3. In this case, there was a class that won according to all of its K — 1 binary 
classifiers for 99.1% of the test instances when considering all datasets.” If we look at each dataset 
individually, the percentage varies from 93.1% (vowel) to 99.9% (pendigits). This percentage of 
unanimous decisions can explain why elimination and Hamming decoding perform similarly in 
terms of accuracy (Platt et al., 2000, Hsu and Lin, 2002). 

We now evaluate how tight are the bounds on the multiclass error €,, and how they can help to 
understand the ECOC performance. It can be seen from Table 3 that € is lower for one-vs-rest only 
for soybean-large and pendigits, which are the two datasets for which one-vs-rest outperformed all- 


7. KreBel (1999) noted this behavior in his experiments. 


12 


NEAREST-NEIGHBOR ECOC WITH APPLICATION TO ALL-PAIRS MULTICLASS SVM 


pairs. For vowel and vowel-lsf (for which all-pairs achieved higher accuracy), €p for one-vs-rest is 
higher than for all-pairs by a factor of 1.37 and 1.65, respectively. In spite of these facts, a careful 
evaluation indicates that only the binary error €, does not suffice to predict the K-ary error. We 
elaborate it as follows. 











timit-plp40 E all-pairs 





H one-versus-rest 


70.0 





Opn 
a 
a 


pendigits 


satimage 


œ 
D 
w 


46.0 
letter 84.6 


eet E £6 3 














vowel-Isf 65.0 78.9 
soybean-large 56.1 74.6 














Figure 1: Comparison of upper bounds and empirical results for ECOCs with Hammming decod- 
ing. The numbers (in percentage) correspond to the division of results in Table 2 by the 
upper bounds on average number of K-ary errors €x for Equations (5) and (6). A result of 
100% would correspond to an ECOC achieving in practice the upper bound on €x. 


Figure 1 shows that the bounds (5) and (6) on the K-ary error Ex for Hamming decoding are close 
to experimental results. These bounds, and consequently the binary error €p, can be effectively used 
to predict Ex when using the Hamming decoder. In practice however, we want to use max-wins 
decoding for one-vs-rest, for which €, alone cannot predict performance. For example, for isolet, 
one-vs-rest has a binary error €, that is 2.5 times higher than €, for all-pairs, but still achieves 
slightly better K-ary error Ex. 


5. Conclusions 


We presented new bounds on the K-ary error of ECOCs with nearest-neighbor decoding. We then 
specialized the bounds for Hamming decoding and œ vs. B matrices. We showed that for large 
enough K, & vs. B matrices with œ = B = K/2, have the same behavior as Hadamard matrices. We 
also conducted simulations to evaluate the bounds and compare ECOCs based on one-vs-rest and 
all-pairs matrices. 

The conclusions of these experiments can be summarized as follows. The bounds are relatively 
tight, and accurately predict the multiclass error based on the performance of the binary classifiers 
when using Hamming decoding. Quantizing the scores (as in Hamming decoding) can be benefi- 
cial for ECOCs with the all-pairs matrix, and we attribute this to the influence of unseen classes, 
for which the scores are unreliable. Hamming and elimination decoding achieved equivalent per- 
formance for all datasets, and we explained that these two decoders lead to the same classification 
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result when one class wins according to all of its K — 1 binary classifiers, which is the case for 
99.1% of our test instances. 
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